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O Abstract 

(N 

We present a two-layer hidden Markov model to detect structure of haplotypes for unrelated 

individuals. This allows modeling two scales of linkage disequilibrium (one within a group of 
■^^ haplotypes and one between groups), thereby taking advantage of rich haplotype information 

to infer local ancestry for admixed individuals. Our method outperforms competing state-of- 
art methods, particularly for regions of small ancestral track lengths. Applying our method to 
Mexican samples in HapMap3, we found five coding regions, ranging from 0.3 — 1.3 megabase 
(Mb) in lengths, that exhibit excessive European ancestry (average dosage > 1.6). A particular 
interesting region of 1.1Mb (with average dosage 1.95) locates on Chromosome 2p23 that harbors 
two genes, PXDN and MYT1L, both of which are associated with autism and schizophrenia. 
In light of the low prevalence of autism in Hispanics, this region warrants special attention. 
We confirmed our findings using Mexican samples from the 1000 genomes project. A software 



package implementing methods described in the paper is freely available at http://bcm.edu/ 
i ^i [cnrc/mcmcmcl 

> 1 Introduction 

cn 

Haplotype variation is central to statistical and population genetics. Studies have revealed that con- 
y— I siderable sharing of haplotypes exists across populations [1J, as well as significant variation among 

■^J- populations [2J. Polymorphic markers are linked on a haplotype; thus, differences in haplotype 

abundance cause linkage disequilibrium (LD, the nonindependence of marginal allele frequencies) 
between markers. Therefore, modeling LD helps to understand haplotype variations. Many statis- 
tical models exist to model LD; however, a model to detect the structure of haplotypes is missing. 
The most elegant model for LD is the coalescent with recombination 0] , or ancestral recom- 
bination graph (ARG). However, despite successful efforts on small-scale datasets [5], ARG remains 
notoriously hard to compute. Considerable efforts have been made to approximate ARG to allow 
computation on a large scale [6lll0j. Among them, the most successful is the PAC model of Li 
and Stephens |8j, which models a new haplotype as an imperfect mosaic of observed haplotypes 
to produce a conditional likelihood; the joint likelihood of all haplotypes is then approximated by 
the product of those conditionals. Along these lines, Paul and Song [TO] made formal derivation 
using diffusion approximation. A somewhat related approach is the clustering model [9], which 
summarizes observed haplotypes into a small number of (ancestral) haplotypes and models the 
observed haplotypes as imperfect mosaics of those haplotypes. 

These models assume that haplotypes are sampled from a single source population and become 
ineffective when haplotypes are admixed. Admixed haplotypes have two scales of LD: the admixture 



LD that formed between alleles in different source populations and typically spans a few to tens of 
centimorgans (cM) |llj : and the LD between alleles within each source population that typically 
spans a few tenths of a cM. The HAPMIX |12| model is among the first to model LD of admixed 
individuals, extending the PAC model to two source populations. This model is very effective for 
inferring local ancestry for two-way admixtures (e.g., African Americans), but it is not yet applicable 
to three-way admixtures such as Latinos (in principle, however, HAPMIX should work with three- 
way admixtures). Two recent examples of progress include LAMP-LD |13j and MULTIMIX [14] . 
both of which achieve similar performance with HAPMIX in inferring the local ancestry of two-way 
admixtures, and can deal with three-way admixtures. However, HAPMIX and LAMP-LD both 
require haplotypes from source populations, and LAMP-LD and MULTIMIX both assume that 
ancestries are fixed within a window of loci and only switch between windows. These methods often 
perform well for recent admixture but underperform for distant admixture, which implies limited 
ability to detect local ancestries of short track lengths. In addition, distantly admixed individuals, 
such as Uyghurs whose admixture occurred more than 100 generations ago, are valuable for disease 
association [15] and human genetic landscape studies [T6] . 

A different perspective of two scales of LD in admixture is structure on local haplotypes. Taking 
two-way admixture as an example, haplotypes from two source populations are separated into 
two groups, and a new haplotype is assigned probabilistically to a group based on its similarity 
with haplotypes in the group. Grouping source haplotypes is equivalent to putting a structure 
on haplotypes. In fact, the structure of local haplotypes is an ubiquitous phenomenon in genetic 
data, and the admixture is just a more apparent example. Even among individuals sampled from 
a single source population, a set of local haplotypes might be enriched in one subset of individuals 
and a different set of local haplotypes enriched in another. For example, individuals of European 
descents may be separated according to whether they have different two-digits human leukocyte 
antigen (HLA)-A allele classes. Compared to differences in local ancestry, the difference in two- 
digits HLA allele classes is more subtle. However, from the perspective of statistical modeling, 
these two scenarios are the same - both require detecting the structure of local haplotypes based 
on their similarities. None of the current methods is designed to handle this more delicate scenario. 

In this study, we present a novel two-layer hidden Markov model (HMM) designed to learn 
the structure of local haplotypes. The new model uses two layers of latent clusters. In each 
layer, clusters are labeled to represent ancestry alleles, and multiple clusters of the same label over 
adjacent loci represent an ancestral haplotype. In a nonrecombined region, the upper layer aims 
to capture structure near the root of a coalescent tree, whereas the lower layer aims to capture 
haplotype variation near the tip. Recombination is approximated by cluster switching within each 
layer. The lower layer clusters are fuzzy mosaics of the upper layer clusters, and haplotypes in 
the observed data are fuzzy mosaics of the lower layer clusters. The fuzzy mosaic represents the 
mode of inheritance for haplotypes: mosaic implies historic recombinations and fuzziness implies 
mutations and uncertainty of inheritance. Existing cluster-based models use single-layer clusters. 
For example, fastPHASE [9| and Beagle [17 use, equivalently, the lower layer clusters to model 
ancestral haplotypes; and the STRUCTURE [18] equivalently use the upper-layer clusters to model 
ancestry populations. Although seemingly incremental, the two-layer model has an attractive 
feature that is not available in a single-layer model - detecting structure of haplotypes. The upper- 
layer clusters represent different groups (populations) and the lower-layer clusters represent group- 
specific haplotypes. This allows us to 1) summarize local haplotypes into different groups and 
2) assign a local haplotype probabilistically into groups, which are two main ingredients for local 
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Figure 1: Structure of haplotypes. Each row denotes a SNP, and each column denotes a lower-layer 
haplotype in our model. We choose two typical regions that each contain 100 SNPs. The plot 
shows the lower clusters dosage conditional on one upper cluster (conditional dosage). Brighter 
pixels indicate higher dosages. The solid edge in the diagram indicates that the conditional dosage 
of a haplotype is either > 80% (connecting to the left upper cluster) or < 20% (connecting to the 
right upper cluster). Otherwise, we draw a dashed line indicating edge uncertainty. 



ancestry inference. 

Local ancestries of admixed individuals provide important information for disease association 
mapping [llj and demographic history [19]. It is an important subject that has attracted much 
recent attention [12H14|, |20TI22| . One way to infer local ancestry is to use ancestry informative 
markers (AIMs) - loci in which the allele frequencies have large differences among populations |23j. 
Inferences based on AIMs usually have low resolutions because AIMs are relatively scarce. On the 
other hand, haplotypes provide richer information that is complementary to the AIMs. Taking an 
extreme example, if population 1 has 50% A-T and 50% T-A haplotypes whereas population 2 has 
50% A-A and 50% T-T haplotypes, there would be no difference in the marginal allele frequencies 
between two populations. However, the two-marker haplotypes are very informative. The two- 
layer model uses local haplotypes in source populations to define population features for each small 
genomic region, and based on which admixed haplotypes are assigned probabilistically to different 
populations. These genomic regions are not prespecified; instead, they are learned from data. 
Compared to methods that group markers in windows and only allow ancestral switches between 
windows |13|. HH] , our method has better performance because prespecified windows may conflict 
with actual ancestral switches. 



2 Results 



The model and its computation are described in details in the Methods section. Briefly, each 
haplotype associates with latent states of S upper clusters and K lower clusters at each marker. 
We fit the model using the Expectation Maximization (EM) and compute the cluster dosages 
conditional on data to infer local ancestry. Let 7 denotes the admixture generation, and let A = 
IOO/7 denotes the average ancestral track length (in cM). 



2.1 Structure of haplotypes 

The two-layer model can detect the structure of haplotypes. To illustrate this, we took the Chromo- 
some 2 of unrelated CEU and YRI individuals from HapMap2 [24] and fit the two-layer model with 
S = 2, K = 10 and 7 = 100, ignoring their population labels. Then, we computed the lower cluster 
dosage conditional on an upper-layer cluster (conditional dosage) for each individual and averaged 
the conditional dosages over all individuals. The conditional dosages for two typical regions (100 
SNPs each) were plotted in Figure [II In one region, the lower clusters are split rather cleanly (but 
not evenly) between two upper-layer clusters; in the other, the lower-layer clusters are split but less 
cleanly with some lower clusters shared between two upper clusters. This example illustrates that 
the two-layer model can indeed detect the structure of haplotypes. Moreover, Figure [T] demonstrates 
that some local haplotypes are population-specific whereas others are shared between populations. 
This local haplotype sharing is an intrinsic feature of genetic data pQ. The two-layer model can 
learn this feature, which is of particular importance in local ancestry inference. As a comparison, 
local haplotype sharing is not a natural part of the HAPMIX [12] model, and a miscopy parameter 
is introduced and (somewhat) arbitrarily specified to adapt to the local haplotype sharing feature 
of the data. 

2.2 Local ancestry inference 

We first illustrate that our method can achieve exceptional accuracy in local ancestry inference. 
We simulated a three-way admixed individual (7 = 20, Supplementary Note) and fit the two-layer 
model (S = 3, K = 15) using this individual and individuals from source populations, excluding 
haplotypes used to simulate the admixed individual. Figure [2] demonstrates the comparison between 
the true and inferred local ancestries. The two-layer model can infer the local ancestry for a 
three-way admixed individual with exceptional accuracy. The loci to which inferred ancestral 
allele dosages do not match well with the truth tend to have large uncertainties. This suggests 
that when combining results over multiple EM runs, the estimates may be weighted by their 
uncertainty, e.g., inverse of variance. Note that for a diploid individual, our method can compute 
the probabilistic assignment to all possible pairs of ancestries at each marker, allowing us to quantify 
the mean and variance of the estimated ancestry dosages. The admixture proportions were also 
accurately inferred (Supplementary Fig. SI). 

Comparison with HAPMIX and LAMP-LD. Next, we compared our method with two 
state-of-art methods used in local ancestry inference: HAPMIX (for two-way admixture) and 
LAMP-LD (for three-way admixture). We used two metrics in our comparison - mean devia- 
tion and Pearson's correlation between the inferred and actual local ancestries for each simulated 
admixed individual. 

For comparison with HAPMIX, we simulated three sets (10 individuals in each set) of two-way 
admixed individuals with 7 = 10,20 and 100 (corresponding to A = 10,5 and lcM respectively). 
The difficulty in inferring local ancestry increases as the admixture generation increases. The 
results of our method were obtained with 5 = 2 and K = 10 and averaged over 10 independent 
EM runs. The results of HAPMIX were obtained using its default parameters. For easier problems 
(7 = 10,20), when both methods perform well, HAPMIX performs slightly but not significantly 
better (two sample t-test p = 0.52,0.63 for deviation, and p = 0.20,0.09 for correlation), whereas 
for harder problems (7 = 100), our method outperforms HAPMIX (p = 5x 10 -4 for deviation and 
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Figure 2: Inference of local ancestry. The plot shows the results of a typical EM run for local 
ancestry inference of a three-way admixed individual. In each panel, x-axis denotes SNPs along 
the chromosome, and y-axis denotes ancestry allele dosage. The black lines in each panel are the 
actual values, and the blue lines are inferred mean dosages. The gray bars on top of blue lines 
reflect ±2 standard deviations of the estimated mean dosages. At each locus, the y-values on lines 
of same color sum to 2. 
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Figure 3: Comparison with HAPMIX and LAMP-LD. The left panel is for two-way admixture 
(HAPMIX) and the right panel is for three-way admixture (LAMP-LD). Results of our method 
are on the x-axis, and results of other methods are presented on the y-axis. Each point in the plot 
represents a simulated admixed individual whose local ancestry track lengths are 10, 5 and lcM 
as shown in the legend. Black points denote mean deviation, and blue points denote Pearson's 
correlation between the inferred and actual local ancestries. 



p = 2 x 10~ 5 for correlation) (Figure pj). Our method has some practical advantages over HAPMIX: 
1) it cleanly handles missing data, whereas HAPMIX does not allow missing data; and 2) it can 
directly work with genotype data, whereas HAPMIX requires haplotypes from source populations. 
When the phasing of individuals from source populations is imperfect (e. g., statistical phasing 
without the help of transmission) , our method has an advantage. 

We compared our method with LAMP-LD for three-way admixed individuals. Similar to the 
comparison with HAPMIX, we simulated three sets (10 individuals in each set) of three-way ad- 
mixed individuals with 7 = 10, 20 and 100. The results of our method were obtained with S = 3 
and K = 15 and averaged over 10 independent EM runs. The results of LAMP-LD were obtained 
with default parameters. Similar to the comparison with HAPMIX, for harder problems (7 = 100), 
our method outperforms LAMP-LD (deviation p = Q x 10 and correlation p = 2 x 10 -8 ). For 
easier problems (7 = 10 or 20), both methods perform similarly if measured by deviation {p = 0.69 
or 0.67). There is a marked difference in performance if measure by Pearson's correlation - our 
method outperforms LAMP-LD (p = 0.01 for 7 = 10 and p = 3 x 10~ 3 for 7 = 20) (Figure^. 
A closer look revealed that LAMP-LD tends to make more mistakes on small regions of a few 
hundred SNPs (Supplementary Fig. S2). We suspect that this has to do with grouping markers 
into windows, even though the recommended window size (50 — 100 SNPs [T3]) is smaller than the 
size of often misidentified regions. In addition, LAMP-LD appears to be very certain everywhere, 
which can be misleading. 

2.3 Excessive local European ancestry in Mexicans 

We applied our method to infer the local ancestries of Mexican samples in both HapMap3 [25] and 
1000 genomes (1000G) projects [26]. 

HapMap3 Samples. We used 112 individuals from CEU and 147 individuals from YRI 
in HapMap3 and 35 individuals from Mayan and Pima in the Human Genetic Diversity Panel 
(HGDP) [27J as three source populations (denote as SP1) to infer the local ancestry of 58 Mexican 
samples from HapMap3 (all genotypes are diploid). We fit the model with S = 3, K = 15 and 7 = 
10, 20 or 50 on Chromosome 2. The mean ancestry proportions for CEU, YRI and Native Americans 
are 0.517,0.048 and 0.435, respectively, in line with what has been reported by others |14| I19j. In 
examining local ancestral allele dosages, we found a 1.1Mb region that contains excessive European 
ancestry (Figure HI). Note that because 1.1 Mb is roughly equal to l.lcM, the result obtained using 
7 = 50 has the strongest signal. Interestingly, this region harbors two genes, PXDN and MYT1L, 
both of which are associated with autism and schizophrenia |28| [29] . Given that the prevalence of 
autism is lower in Hispanic children compare to other ethnic groups [30, 31J, further studies are 
warranted to clarify whether and how this region contributes to the lower prevalence of autism in 
this population. 

Encouraged by this finding, we further analyzed all autosomes and identified four additional 
genetic regions (Supplementary Fig. S3) that exhibit excessive European ancestry dosages (EAD). 
We chose EAD of 1.60 as a cutoff because it represents roughly 6 standard deviations away from the 
mean based on a conservative estimate assuming binomial sampling. Noticeably, 1) all five regions 
are short, ranging from 0.30 to 1.26Mb; 2) all are coding regions, many genes in these regions are 
immune related; and 3) these regions contain multiple hits from genome-wide association studies 
- undoubtedly, our findings will help to interpret results and to design follow-up studies. It is 
worthwhile to note that because these regions are short, only a method with high resolution can 
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Figure 4: Excessive local European ancestry in Mexicans. The y-axis is the average European 
ancestry allele dosages over 58 Mexican samples. Black segments indicate SNPs. The inlet is a 
close-up of the peak. The peak region is in 2p25.3, ranging from 1.59Mb to 2.70Mb; the average 
European allele dosage is 1.95. 



detect them. (This might explain why this interesting phenomenon has never been reported before.) 

1000G Samples. To double check our findings, we analyzed Mexican samples in the 1000G. 
Using identity by state, we identified that 29 of the total 66 samples overlap with HapMap3 Mexican 
samples. For SNPs that are typed in both projects, there is a high genotype concordance for all 29 
samples (average Hamming distance < 0.002). We inferred the local ancestries of these 66 samples, 
using haplotypes of CEU and YRI in 1000G and genotypes of Maya and Pima in HGDP as three 
source populations (denote as SP2). We found that: 1) four among five regions were also discovered 
using these 66 samples; the EAD of the region on chromosome 12 dropped from 1.72 to 1.49, and 
EAD of a region on chromosome 18 increased from previously 1.50 to 1.60 (see Supplementary 
Fig. S3 for the region); 2) among 29 overlapping individuals, the inferred admixture proportions 
have a high concordance between two choices of source populations SP1 and SP2 (Figure ph. 
Because we used unphased CEU and YRI in HapMap3 as source populations (SP1) for HapMap3 
Mexican samples and used phased CEU and YRI in 1000G as source populations (SP2) for 1000G 
Mexican samples, this high concordance suggests that the phasing of CEU and YRI in 1000G is 
reliable; and 3) the 37 non-overlapping individuals in 1000G have an average smaller European 
ancestry proportion of 41.9% compared to 56.6% of those 29 overlapping individuals (Figure pi), 
and this difference is unlikely caused by random sampling (permutation test p < 0.004). 
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Figure 5: Admixture proportions of 1000G Mexican samples (Chromosome 2). The left plot shows 
the concordance of 29 overlapping individuals genotyped in HapMap3 and 1000G Two points that 
belong to a same individual are connected by a short segment. The right plot shows the remaining 
37 Mexican samples in 1000G Each individual has inferred admixture proportions, a triplet (x, y, z) 
with x + y + z = 1 . An unique point can be determined when each component represents distance 
to an edge of an equilateral triangle. 

Since 1000G provides phased haplotypes for Mexicans, we therefore inferred the local ancestries 
of these haplotypes, using three source populations SP2. We found that the inferred local ancestries 
have excessive ancestry switches compared to those using unphased genotype data (Figure [6]). These 
excessive switches are likely caused by imperfect phasing - when using diploid genotypes our method 
integrates out phase uncertainties. Phasing admixed individuals is a hard problem. Our results 
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Figure 6: Comparison between phased and unphased 1000G data. The plot shows the inferred 
European ancestry allele dosages (y-axis) of a typical Mexican individual. The x-axis denotes 
SNPs. The blue (pink) line denotes inferred values using unphased (phased) 1000G data. The 
excessive ancestry switches of pink line indicates imperfect phasing. 

suggest, from an indirect angle, that there is room for improvement in this area and we anticipate 
the two-layer model to make meaningful contributions. 

3 Discussion 

We have presented a two-layer HMM to detect structure of local haplotypes, and demonstrated 
its utilities in local ancestry inference. Our method can directly work with diploid data and 
thus eliminates phase uncertainty that often plagues other methods. The prevailing model for 
admixture is one pulse model, meaning haplotypes from two source populations mixed once some 
generations ago and continue to admix afterwards without influx of additional haplotypes from 
source populations. In reality, however, this assumption is overly simplified. Treating the mixing 
generation as a parameter, the two-layer model can average results over multiple choices of mixing 
generations. This makes our method applicable to the scenario of continuously mixing, which is 
perhaps a more realistic model for admixture. More importantly, our method has a high resolution 
- owning to flexible choices of mixing generation, it is able to detect ancestry segments of lcM or 
smaller as demonstrated in the Mexicans data analysis. 

Because structure of haplotypes is an ubiquitous phenomenon in genetic data, the two-layer 
model has many other potential applications. 1) Using lower cluster dosages we can compute 
pairwise local haplotype sharing (LHS), defined as the probability of two haplotypes descent from 
a same lower cluster, which reflects genetic relatedness among haplotypes. Preliminary studies 
suggest that LHS can be used to impute HLA alleles and detect genetic associations. 2) As the 
two-layer model can infer the local ancestry with high accuracy, it is reasonable to speculate that it 
will also be effective in genotype imputation and phasing for admixed individuals. 3) Our method 
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can directly estimate cluster-switch rates between adjacent markers, and this permits the inference 
of recombination rates and hotspots, which will be particularly useful for admixed individuals. 
4) Aggregating is an effective method for detecting rare variants associations [32]. For admixed 
individuals, it would be helpful to aggregate rare variants of the same local ancestries. 

Last, because a diploid individual has two sets of latent states (one for each haplotype), our EM 
algorithm is quadratic in both S and K, and linear in numbers of individuals and markers. This 
potentially limits the two-layer model's applicability. For local ancestry inference, when one can 
use phased data in source populations, the computation is fast because for a haploid individual, our 
EM algorithm is linear in S and K. Finding an appropriate linear approximation to fit our model 
for diploid individuals is a challenge and we are actively investigating this problem. The recent 
progress concerning linear algorithms to fit the PAC model [33j is extremely encouraging. Note 
that this quadratic computational challenge might disappear in the near future due to the recent 
development of methods such as phase-seq [34], which produces genomic sequences completely 
phased across the entire chromosome. 
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Methods 

For ease of presentation, we assume that haploid individuals are being observed. Treating a diploid 
individual as having two random haploids subject to genotype constraint, our model applies directly 
to diploid individuals (Supplementary Note). 

The two-layer HMM 

We assume the numbers of clusters in upper and lower layers, S and K, respectively, are fixed and 

(i) (i) 

the number of haplo types is iV and number of markers is M. For each individual i, let X m ,Y m 

(i) (i) 

be the latent state of the upper and lower clusters at locus m. Here X m and Y m take integer 

values to denote different clusters; each lower (upper) cluster associates with a parameter 6 m . (r/ m .), 

representing ancestral allele dosages. We may drop the superscript when referring to an arbitrary 

individual. 

The main HMM. The emission of an observed haplotype marker from a lower layer cluster 
is modeled as 



p{ti m |A TO , Y m ,c,)—p[h m \Y m ,t,) 



if h m = 1 
1 - v(i) if h m = . (1) 



mY® 



mY' 



1 if h m is missing 



N M 



The complete data likelihood has the form 

i=l m=l 

The transition of the Markov chain on latent states is modeled as 

r\J\-rn = S, Y m — rC\^-m~ 1 = S , Y m — \ = K ) 

= imCtsPmsk + (1 - jm)r m f3 msk I(s = s') + (1 - j m )(l - T m )I(s = s')I(k = k') 

where I(a = b) is an indicator function and j. and r. are cluster-switch probabilities for the upper 
and lower layers respectively; and 

p(X l = s, Y 1 = k) = p(Y 1 = k\X 1 = s)p(X l =s) = a s p lsk , (4) 

where a. is an individual specific S- vector to denote the admixture proportion, and f3 m .. is an S x K 
matrix shared by all individuals. 

We made three assumptions on the transition matrix of the hidden states. First, conditional on 
switch, the cluster to land at locus m is independent of cluster at locus m — 1. This assumption, 
used by previous models [HI [9], reduces the number of parameters and simplifies computation 
compare to specifying a whole transition matrix. Second, conditional on switch, the cluster to land 
is homogenous across loci and individual specific at the upper layer but heterogeneous across loci 
and shared among individuals at the lower layer. The assumption on the upper layer makes a. 
naturally admixture proportions; the assumption on lower layer accommodates two facts: 1) LD 
patterns are heterogeneous across loci and 2) LD is a group property. Third, we assume that if 
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the upper layer switches, then the lower layer must switch, and the lower layer might switch if the 
upper layer stays. This encourages upper layer-specific LD patterns. 

In the context of the local ancestry inference, the aforementioned transition explains how the 
two-layer model captures two scales of LD. A haplotype copies mosaically from (ancestral) hap- 
lotypes in one source population, and then may switch to another source population and copies 
mosaically from its haplotypes. The upper-layer switch probabilities j. determine how frequently 
switches occur between different source populations and the lower layer switch probabilities r. de- 
termine how frequently switches occur between ancestral haplotypes within each source population. 
Thus, the model can accommodate the two scales of LD observed in admixed individuals. 

In the main HMM, the upper layer latent state X m only contributes to transitions of latent 
states (through /3), not to emitting a marker or 9 estimates (likelihood involves no rf). This works 
well when K is not too large, but it works less well for large values of K. To stabilize the estimates 
of 9 for large values of K, we use an ancillary HMM to model r\ emitting 9. 

(k) 

The ancillary HMM. We model emission of 9 mk from an upper-layer cluster Wm as 

p(9 mk \W^,0 = Beta(9 mk ;F VmwLk) ,F(l-ri mwLk) )), (5) 

where Beta(x; a, b) denotes a Beta density with parameters a, b. This emission is adapted from the 
Balding-Nicols model [35]. The original model is designed to model population divergence, and 
hence, F is specified through Fst values between different populations. In this situation, we use it 
as a "random effect model" to stabilize 9 estimates. For computational convenience, we set F = 1 
(Supplementary Note). We treat 9 mk . as observed, and the complete data likelihood has the form 

K M 

p(0.i, . . .,9.K, W^, ..., WW\£) = [] II P(Omk\wW,0 p(Wg>\t). (6) 

fc=l m=\ 

The transition of the latent states is modeled as 

p{w[ k) = s) = a« 
p(WW = slW^l, = s') = p m a« + (1 - Pm )I{s = s'). 

We assume the jump probabilities p. are unrelated to the jump probabilities of j. and r.. 

Model fitting 

In the main HMM, the collection of parameters £ contains 9 (an M x K matrix), /3 (an M X S X K 
matrix) , a (an N x S matrix), and j and r (both M vectors). In the ancillary HMM, the set of 
parameters contain r\ (an M x S matrix), a (a K x S matrix) and p (an M vector). We briefly 
discuss how to estimate these parameters using Expectation Maximization (EM), focusing on the 
main HMM. For details, please refer to the Supplementary Note. 

For an arbitrary individual i, we write the forward probability (p(m,s,k) = p(hx- m ,X m = 
s,Y m = k\£) and backward probability tp(m,s,k) = p{h m+ i-M\X m = s,Y m = fc|£). Both the 
forward and backward probabilities can be computed analytically. Then we obtain the posterior 
probability of latent states at each marker p(X m = s, Y m = k\h, £) = </>(m, s, k)ip(m, s, k)/p(h\£) for 
each individual. This allows us to compute quantities to update each parameter. Similarly, we can 
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also update parameters in the ancillary HMM. Note that both h and r\ contribute to updates. 
Upon convergence of EM, we have £*. 

Constraint on cluster switches. Estimating switch probabilities j and r is trickier than 
others for two reasons. First, their updates are sticky, as a large j m (or r m ) estimate in a previous 
iteration often results in a large estimate in the current iteration. Because of this stickiness, the 
choice of initial values of j and r heavily affects the point at which they converge. Second, j and 
r are not completely identifiable. If a s j3 ms k concentrates on a particular cluster pair (s,k), then a 
large probability in either j m or r m results in a similar likelihood. We overcome these difficulties 
by putting constraints on j and r - the coalescent intuition behind the two-layer model come to 
assist. 

Let r m = 1 — exp(— t m ,), where t m = 4iV e c m 61 is the lower-layer cluster switch rate, where iV e 
is the effective population size and c m is the genetic distance between markers (m — 1) and m. We 
assume that c m is known. In practice, we approximate it using lmegabase(Mb) per cM. Recalling 
that Tp. = l/( 2 ) is the mean coalescent time for k lineages, we have 61 = Tk+i + • • • +T/v = -g — ■»?. 
Assume N » K, and we have 61 ~ 1/K. This leads to a natural choice for constraint on t m (and 
hence r.). For example, if N e = 10, 000, ^ c m = 5cM and K = 10, then we may apply constraint 
Y^tm = 500. In practice, we directly estimate r m and compute t m , and then we rescale t m to 
match the constraint and reestimate r m . 

Let j m = exp(— tm ), where tm = 4iV e c m 5 U . We are tempted to follow a similar coalescent 
argument to specify 5 U . However, unlike 5i, which is robust to recent demographic history because 
it pertains to an "ideal" ancestral population, 5 U is heavily influenced by demographic histories 
(for example, admixture generations) and the coalescent argument becomes ineffective. To work 
around this issue, we constrain j. through the admixture generation 7. In practice, we first estimate 
j m and compute tm > and then we use Y^ t™, = 7 Yl c m to rescale tm to reestimate j m ■ Define 
ancestry track length A = Mj Y2 tm, then A and 7 follow a simple relationship 7 A = 100. 

Inference and computation. We are interested in the upper cluster dosage P(X m '\h^\^*) 
for each individual i, which represents the local ancestry; its average represents the admixture 
proportion. After trial and error, we arrived at the following model-fitting tricks that improve 
performances: 1) Because the dimension of £ is high and standard EM procedures tend to converge 
to a local mode instead of the global mode, it is useful to average inferences over multiple EM runs. 
2) It is helpful to initialize parameters with values that best preserves symmetry, e.g., 6 m k ~ 0.5, 
cr*) ~ 1/S, and /3 ms fc ~ 1/K for all values of m, s and k, respectively. The initial values can be 
simulated from symmetric Beta or Dirichlet distributions with large rates. 3) The training data 
from source populations can be either phased or unphased. The difference is small when phasing 
is accurate and the computation with phased data is faster (linear vs. quadratic in terms of S and 
K). However, when phasing is less accurate, for example, pure statistical phasing without help of 
transmission, using unphased data is preferred. 4) The common practice used in imputation [36], for 
which one first fits the model to the training data from source populations and then runs forward- 
backward algorithm once on the admixed individuals, tends to produce spurious ancestry switches 
in spikes. Performing additional EM steps using both training data from source populations and 
admixed individuals (joint model fitting) reduces spurious ancestry switches. We recommend joint 
model fitting. 
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5 Supplementary Note 

5.1 Model Motivation 

Suppose we observe N haplotypes hS 1 ', . . . , h^ N '. For a nonrecombined region, these N haplotypes 
relate to one another through coalescent trees (Fig. Nl for a typical r). We are interested in 
computing a joint likelihood p{lv- l \ . . . , hS N '\£), where £ is a collection of parameters to be specified 
later. Traditionally this is done by first sampling a set of r and then conditional on each r 
integrating out ancestral state 9 using a "peeling algorithm" [c.f. ? ]. This computation involves 
sampling trees and this process can be extremely slow (and becomes much slower when considering 
recombinations). To make the joint likelihood computable for modern genetic data (e.g., genetic 




Figure Nl: Approximate coalescence. A coalescence tree r on the left is cut at two different levels. 
The subtrees below each cut are approximated by star-trees to obtain k on the right. 

data from genome- wide association studies), we approximate the coalescent tree with a simpler 
structure - two layers of clusters (not counting the root and leaf layers, Fig. Nl). We make two 
cuts on a tree and group nodes according to subtrees to form clusters. The first cut is near the 
tip and individual haplotypes are grouped into lower clusters; the second cut is near the root and 
the lower clusters are grouped into upper clusters. Because a coalescent tree is a random tree, we 
allow each lower cluster to have positive probabilities of connecting to any upper clusters, and each 
individual haplotype to have positive probabilities of connecting to any lower clusters. Thus, n 
can represent not just one but many realizations of coalescent trees. We learn those probabilities 
from the data. Cutting at different places will result in different numbers of clusters. In our HMM 
model, we prespecify the number of clusters in each layer and then cut the tree at places to match 
the number of clusters. 

Approximating r with k allows us to condition on ancestral states integrate out n, and then 
integrate out the ancestral states. Because k is more regular, it can be represented by latent states 
and recombination can be conveniently approximated by cluster switching within each layer. These 
give us a two-layer HMM that greatly simplifies computation. 

5.2 Choice of parameters 

The companion software is easy to use - users only need to specify three parameters: S, K and 7. 
For local ancestry inference, S usually is clear a priori. We used K = 5S in our study, but the 
method is robust to a wide range of K values. We demonstrate this through examples. For a set 
of simulated two-way admixed individuals, we used S = 2 and K = 5, 10 or 20 to fit the model. 
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o K=1 deviation 
v K=20 deviation 
o K=1 correlation 
v K=20 correlation 



Figure N2: Comparisons for different K. The plot shows comparisons between K = 5 (on the 
x-axis) and K = 10 or 20 (on the y-axis). Black color denotes deviation and the red color denotes 
correlation between inferred and value and the truth along a chromosome. 

The results indicate that K = 10 or 20 outperforms K = 5, especially when using correlation as a 
metric. However, the difference between K = 10 and K = 20 is small (Fig. N2). 

As a rule, we recommend averaging results over multiple choices of 7. However, in general, 
7 = 10 for African American samples, 7 = 20 for Latinos, and 7 = 100 for Uyghurs appear to 
be good choices. In our simulation studies, the local ancestry inference is robust to 7 up to a 
multiple of 2. However, 7 affects the smoothness of the local ancestry inference. This can be 
best demonstrated through examples. We simulated two-way admixed individuals with admixture 
generation 7 = 100 and fit the model using 7 = 50,100 and 200 respectively (Fig. N3). For all 
individuals, we found that small values of 7 produces smoother local ancestry estimates than large 
values of 7. Nonetheless, for all three choice of 7, the main ancestry blocks are well inferred. Taking 
one individual as an example, the three deviations for three choices of 7 are 0.067, 0.062, and 0.092, 
respectively, with 7 = 100 performing the best and 7 = 200 performing much worse than the 
other two, presumably because the metric is more sensitive to smoothness. In fact, if measured by 
Pearson's correlation, we obtain 0.934, 0.947, and 0.932 respectively. As a comparison, the deviation 
for HAPMIX is 0.067 and the correlation is 0.939. Although similar quantitatively to our method, 
HAPMIX does miss a major ancestry block in the middle (Fig. N3). 



5.3 Simulate admixed individuals 

The procedure we used to simulate three-way admixed individuals is similar to what is used in 
HAPMIX [p2] for two-way admixture. For a given admixture generation 7, we compute the average 
ancestral track length A = IOO/7 and then compute t = 1000A (a region of 1Mb contains approx- 
imately 1000 HapMap SNPs). We randomly choose three haplotypes, h c ,h y and h a , from CEU, 
YRI, and CHB+JPT, respectively, and then copy from the three haplotypes to form a new admixed 
haplotype by repeating the following three steps: (1) let s be the current position and generate a 
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Figure N3: Comparison for different g. The x-axis denotes genetic markers along a chrosome, the 
y-axis is inferred allele counts at each marker of an arbitrarily chosen ancestral population. The 
black lines are the simulated truth; the red lines are inferred values with different choice of mixing 
generations; the blue line are the results of the HapMix as a comparison. Each column denotes 
an individual, 7 = 50, 100, 200 from top to bottom panels. The individual on the left is explained 
in the main text. For the individual on the right, the deviation error are 0.053,0.054,0.077, and 
correlations are 0.941,0.946,0.939 respectively. As an comparison, HAPMIX has deviation 0.086 
and correlation 0.866. 
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number w according to an exponential distribution with mean t; (2) copy SNPs (s, s + w] from h a 
with probability <5i, from h y with probability 62, and from h a with probability £3 = 1 — 81 — 62; (3) 
increase s by w, finish if s exceeds the total number of SNPs. Two admixed haplotypes are paired 
randomly to form a diploid individual. The markers are then thinned to match the Illumina 650K 
SNP chip. The two-way admixture can be simulated accordingly. We chose (0.8, 0.2) as the target 
two-way admixture proportions and (0.6, 0.2, 0.2) for three-way admixture proportions. Note that 
the simulated admixture proportions varies due to a finite number of SNPs. 

5.4 Expectation Maximization 

First outline the EM algorithm. Make an initial guess of parameters £*, and write down the 
complete data likelihood, denoting Zm = (Xm , F™ ), 



n M 



J (^ i ) J ...,/ l (") J z( i ),...,z(")|o = nn^mi^^op(^ ) i^m-i,op(M w.rMsne 



i=l m=1 



Then estimate a new £ by 

argmax ? E zlhW _ h(n) ^ [logp^ 1 ), . . . , fcW Z« . . . , ZW\£) 



(8) 
(9) 



Set £* = £ and iterate the procedure until £* converges. 

To elaborate on the EM algorithm: conditioning on £*, the posterior distribution of p(Z^'\h^', £*) 
can be computed for each i. To estimate £, one can either sample many paths from p{Z^ l '\h^ l \ £*) 
(hard EM), or one can integrate out p(Z^\h^\^*) analytically (soft EM). Intuitively, soft EM 
will perform better because it does not introduce sampling variation. However, with hard EM 
only forward probabilities need to be computed in order to sample from p(Z^ l '\h^\^*). More im- 
portantly, computational tricks may be applied on the sampled paths to avoid possible traps of 
local optimum. In the paper we use soft EM for model fitting and report possible computational 
improvement elsewhere. 

A diploid individual has two sets of latent states at each marker Z^ = (X^, Y^), Z"^ = (X^, Y^) 
to indicate the upper and lower layer cluster membership (we drop the super script for individual and 
this should cause no confusion). The conditional likelihood for i-th. individual is p(g^>\Z l , Z 2 ,^) = 

Ylm=iP(9m \ Y L Y L€) with "emission" 



(10) 



where 









tjt k 


if <7m = 2 


pIq^IY 1 


= jY 2 = 


= k,o = < 


tj(l - t k ) + (1 - tj)t k 
{l-tj)(l-t k ) 
, 1 


if gin' = 1 

if g$ = 

., (i) . 

it gin is missm 






tj = m j{l — /i) + (1 — 6 m j)fi 








tk = &n 


lk (l - fi) + (1 -0 mk )n 





(11) 



and \x = 4iW is the scaled mutation rate. In the implementation we used [i = 0.001. Note the 
one-to-one correspondence between t. and m ., and that we implicitly assumed Hardy- Weinberg 
equilibrium in the emission. 
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5.4.1 Forward and backward recursion 

In what follows, every probability statement is conditional on £*. We assume there are M biallelic 
SNPs. The forward recursion. 

<f)(m + l,s 1 ,k 1 ,S2,k 2 ) = P \9iL+n z L+i = ( s ii k i), Z m+i = (s 2 ,k 2 )\C) 
= p{gS +1 \Z m+1 ) Y^ <f>{rn, s[, k[, s' 2 , k' 2 )p (Z^ +1 \Z^ = (si, fci)) p (Z^ +1 \Z^ = (s' 2 , k' 2 )) ( 12 ) 

s'.,k'. 
= P(9m+l\ Z m+l) (Jm+l PU + jrn+liX ~ jm+l)(PlO + POl) + (1 ~ jm+lf Poo) , 

where 0(1, s 1} fci, s 2 , k 2 ) = a^ , Pi, sl ,ki<Xs2 , Pi,s 2 te p(Si \si,h,s 2 ,k 2 ) and 

Poo = (1 - r m+1 ) 2 (j)(m, si, fci, s 2 , k 2 ) + r^ +1 ^ 0(m, si, fci, s 2 , ^2)/3 m+ i, Sl ,fc 1 /3 m +i, S2 ,fc 2 + 

(13) 

l+ l(l - r m+ i) I ^ 0(m, 51, fc 1; 32, fc2)j0m+l,si,fei + ^ 0(ra, Si, &!, «2, fc 2 )^ m+ i, S2 ,fc 2 

\ fc l fe 2 / 

Pio = ai l 1 ) ^m+i,si,fei (fm+i ^ 0(m, si, fci, 5 2 , fc 2 )/3 m +i,s 2 ,fc 2 + (1 - r m+ i) ^ 0(m, si, fci, s 2 , k 2 

Q f y y q' y 

(14) 

POl = (XsiPm+lMte (r m+ l Yl ^( m > Sl ' fc l' S 2' k 2)Pm+l,si,ki + i 1 ~ r m+l) ^ </>(m, Si, fci, S 2 , fc 2 )J 

c' y y c ; t*' 

(15) 

pn = a W P m +\, Sl M a *lPrn+l,S2M ^Z ^( m > 5 i' fe 'l' S 2> k ' 2 ). ^ 

«' J-' «' V 

All summation with dummy variables s',i' only needs to be done once. This is the benefit of 
the parameterization for Markov transition described in the paper. The overall complexity of the 
forward and backward recursion is 0(MS 2 K 2 ) for diploid individuals and O(MSK) for haploid 
individuals. 

The backward recursion. Note p(g^. M \si, fci, s 2 , k 2 ) = ip(m, si, fci, s 2 , k 2 )p(g}n\si, fci, s 2 , fc 2 )- 

^(m-l,Sl,fcl,S 2 ,fc2) =p(#m!Ml Z i-l = ( 5 l) fe l))-^m-l = (*2,fe 2 )|C) 

= £ ^(m, si, fci, s' 2 , fc 2 >( 5 «|si, fci, s 2 , fc 2 )p (Z^ = (si, fci)^) p {Zl = (s 2 , fc 2 )|Z 



2 \ 
m—lj 



c / ju/ c / t,/ 
*l' /t l'' , 2'' t 2 



= (j'm 911 + JmO- ~ jm){qiO + 001 ) + (1 ~ jm) 2 <?Oo) , 

(17) 
where ^>(M, si, fci, s 2 , fc 2 ) = 1 and 

900 = d^ /3m, Sl ,fci/3m, S2 ,fc 2 P(ff2Ml S l' fc l' S 2, fc 2 ) + (1 - FnOVflfijlf l m > s l> k U «2, h) 

y y 

11 2 r a) a) i (18) 

+r m {l-r m ) x [^/3 miSli ^p(5„; M |si,fci,s 2 ,fc2) + ^/3 m , S2i ^p(^; M l' s i' A; i' s 2,fc2) 
fej fc 2 
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?10 = r m Yl a y i ) ^n»,«' 1 ,fc' 1 ^m > a a ,^P(fl , m:Ml s i' *!> S 2' ^) + (l-^m) X] "i'^m^.fe^^M l«l. *!> s 2 



c' t* ; t*' c' y 



(19) 

901 =r m Yl a S 2 ) ^m,«i,^^m,«i,^iP(fl , m:ilfl 5 l' fc l> s 2) ^)+( 1 -^m) X] "^/W^KsSmI*!' *1> *2> 4) 

(20) 

911= 5Z "i'^m^^^ai^m^^fc^^rn-MNl'^i' 5 ^^)- (21) 

«' J-' «' t-' 

The posterior of latent states at each locus for each individual can be computed via 

(22) 



V\Z)n = (si,ki),Z^ = {s 2 ,k 2 )\g ( - t \Cj « </>(m, s±, h,s 2 ,k 2 )ip(m,s 1 ,ki,s 2 , k 
and rcnormalize to have £ sljfeljS2j fc 2 P ( Z m = ( s b fc i)>^m = ( s 2, k 2 )\g ( - l) ,C) = l - 

5.4.2 Update 9 

To update parameters in each EM steps, we solve for each component x of £, 

d 



2 J 



^iw 



logp(h,g,zW,...,zW\£) =0. (23) 



(24) 



Assume we have both diploid g and haploid h individuals in our data. For diploid individuals, 
at locus m, write p ijfe = Y. Sl ,s 2 P{ Z m = ( s iJ)i Z m = 0»2,fc)l5 W >£*)- Let ^ = {i : s4V = A;} for 
/c = 0,1,2. Similarly, for haploid individuals, at locus m, write qij = ^2 s p[Z m = (s,j)\hm,£*)- 
Let Tfc = {i : hm = k} for k = 0, 1. Let 

a oj© = 2^ Pijfc) Q Qjj = /^PiJh 

i&S ,kytj ieSo 

a 2j© = zJ ^' fc ' a 2j'i = 2J p W' 

i&S 2 ,k^j ieS 2 

ai i k = / , Pijfe) a ljj = / , Pijj ; 

S' = zZ ^'' &1 j = zJ *•?• 

iST i6Ti 

Take derivative with respect to 9 m j and sum over A; for diploid individual to get 

F i^ = ■7-—{(iQ j& + 2ao jj +a ljj + bo j ) + - (a 2jQ + 2a 2jj + aijj + &i j ) + ^ _ * aij fc = 

(25) 
for each j = 1 . . . , K (recall K is the number of lower-layer clusters) . We have K equations with 
K unknowns and we can solve numerically for tj and hence 9 m j- To do so, we need the Jacobian 
J(t.) = (djk) where 

dF — 1 

*>" = ^ = (t 3 + t k -2t 3 t k )^ k fOT k * * (26) 
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and 

-1 -1 v-^ (1 - 2t fc ) 2 

d 3J - Ty _ t \2 ( a °i© + 2a °n + ai ii + %' ) + ^2" ( a 2i0 + 2a 2j j + a ijj + &ij ) + 2_> ( t . + \ - 2t t ) 2 aijk ' 

(27) 
We can solve «/(*(")) (t( n+1 ) - t^) = -F(*( n )) for the unknown t( n+1 ) - *( n ). 

Compare to the update used in [9] , this update for 8 does not directly involve its previous value. 
Perhaps unwilling to solve a linear system repetitively, Scheet and Stephens [9] used approximation 



to the last terms of Equation ( 25 ) 



£ tj lC% tk ai '» - £ i}-^ - rV*) ' <28) 



where 

tj Ttfc ^tjtk tj + tfc Ztjtfc 

which can be computed by approximating tj and £& with values in the previous iteration. Denote 



/ lj\i.-l k ) „ _ l k {± - Lj) 

a ijk - ,. , . _r,.. a ijk, a ljk - - | ^ ^T77 a ijfe, i^yj 



"i- - 2_^ Gl i fc ' ai i© ~~ 2-^i a 



lie ~ Z^ "life' Ul i© ~~ Z_> "life' (30) 



and we have, 



-1 

- 1 
and solve to get 



■FjC*©) = 1 — (aoj© + 2a 0j j + Oyj + 6 j + a![ j& ) + — (a 2 j + 2a 2 jj + aijj + hj + a'ij ) = (31) 



1-tj -^ " JJ " J " J ^' £j 



(a 2 j + 2a 2j j + o^g + aijj + &ij) 

( a 0j© + 2aojj + ay© + 2aijj + a2j© + 2<X2jj + &oj + &lj) 

With ( 32 ) as a starting point only a few iterations are needed to estimate 9 using numerical method 
described earlier. Note however, solving the linear system has complexity 0(K 3 ), which makes the 
complexity of model fitting to be 0(max (MS 2 K 2 , MK 3 )). 

5.4.3 Update Markov transition parameters 

To estimate Markov transition parameters, following [9], we introduce latent state transitions 
(jumps) Ji m and Ri m occured between locus m — 1 and m at upper and lower layers for indi- 
vidual i. Denote Ji ms the number of upper layer jumps to X m = s, and Rimsk the number of lower 

(i) " (i) 

layer jumps to X m = s and Y m = k. Recognize that Jims and Rimsk are sufficient for a,/3,j and 
r, we have 



a 



(') 



Pmsk 



EXL*E.E[J im .\g®,?] 

J2iE[Rimsk\g {i) ,e} 



E l ,k E i^msk\g^,^} (33) 

. _ Zi,sE[Jims\9^,C] 

Number of haploids 

Zi,., k E[R*nsk\9®,e] 

Number of haploids x S ' 
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where recall that S is the number of upper-layer clusters. 

In what follows, when a state in forward or backward probabilities was substitute by a dot, it 
means that component was summed over. Note that p(g^ l '\^*) = (j)(M, -, •, •, •) and 

P(9rn:M\ S ' fc l> S 2, fa, C) = Pidmlsi, fa, S 2 , fa,C)^(™, «1, &1, «2, fa)- 

First E[J ims | 5 W,e] = 2p(J ism = 2|<7«,e) + p(J* m = l|ff (i) ,D with 
2p(J ism = 2\g^,C) = 2(a^j m ) 2 / P (g^\C)x<t ) (m-l,;;;-) £ ^^p&I^M.fe.f), 

fcl,fe2 

(34) 
p(J sm = l|5 (i) ,r) = J'm(l - jm)a«/p(5 (0 ir)x 

[(l~»"m) X] ^( m_ 1 r,-,S2,fc 2 )^/3 ms fc 1 p(5m:Ml S ' fc l' S 2,^2,r) + 



and 



«2,fc2 fel 

r m ^2 ^ m ~ 1; '' '' S2 ' ') X] ^2fc 2 / 3 m^iP(5 , m:Ml S ' fe l) S 2, &2, £*) + 

(l-r m ) ^ 0(m- l,si,A:i,-,-)^/3 m sfe 2 p(5'^Ml' s i' fe i'' s ' fc 2,r) + 

Sl,fcl &2 

7"m ^ ^( m ~ 1 ' Sl > "' "' ') X] &nsifci#msfc 2 P(0m : Af l 5 l' fcl ' S ' ^ ^* 



Sl fci,fc2 



Second, 



E[R %msk \g {l \e] = MRimsk = 2,J ims = (%«, O +P(^msfc = 1, Jim- = <%« O 

+ pyitimsk = t, Jims = J- 1 C/ ,q J, 

with each component being 

2p(i4„ sfe = 2, J jms = 0| 5 »,D = 2(1 - j m ?r m l5 msk /p{g^\C)x 

4>(m - 1, s, •, s, •MsWmI' s > fa s > fc > D, 

PiRimsk = 1, Jims = (% W ,£*) = (1- Jm) 2 rm(l ~ r m ) Pmsk / P^ \C) x 

[ ^2 <f>{m- l,s,-,s 2 ,k 2 )p(g ( £. M \s,k,s 2 ,fa,C) 

S2M 

si,fei 
P(Rimsk = l,Jims = l\g {l \C) = jm(l ~ jm)rmPmsk/p{9® ID X 

0(m- 1,S, -,-,•) ^ ai!, ) /3 mS2 A :2 p(5m:Ml' S ' fc ' S 2 ; fc 2,D 
«2,fc2 

+ 0(m- 1,T,S, •) ^ ai^msifeiM^-M^l'^l'*'^^* 

Sl,fel 
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(35) 



(36) 



(37) 



(38) 



(39) 



Finally, special treatment is needed at marker m = 1. For each s, k set 

E[RiUk\g®,C] = tx 8 Pi Bk p(g { £ M \8,k,;;C) (40) 

and renormalize such that J2 s k-E[Rilsk\9 >£*] = d, where d = 2, 1 for diploid and haploid indi- 
viduals respectively. Set E[Jn s \g^\C] = Y.k E [ R Hsk\g^\C]- 



5.4.4 Ancillary HMM 

The expected complete data log-likelihood is given below. 



E, 



W\h,g£* 



K M 
j=l m=l 



M K 

Y Y log p ( 0m 3 1 r?ms ' £* } p ' 

m=l j=l 



mjs 



(41) 



where p m js is the s-th upper cluster dosage of of j-th haplotype at marker m. From Balding-Nichols 
model [35], we have 



p(0 m j\n 



m] | 'ImsJ 



nFrim.s - 1 . 

B{Fr] ms ,F(l-ri ms )) m i 



i-e m] ) F ^-^-\ 



(42) 



Combine above two equations and drop the m in notation, we have for an arbitrary marker 

K 

f(9j, r, s ) = Y, [- log B(F Vs , F(l - Va )) + (F Vs - 1) logfl, + (F(l - r, 3 ) - 1) log(l - O^p.,,. 



i=i 



_d_ 
d~6, 



f(6j,Vs 



Frj s - 1 F(l - rj s ) - 1 

e, (1-0,) 



P-js 



(43) 



This suggest that we add (Frj s — l)p.j s to the top and (F — 2)p.j s to the bottom of (32 ) to estimate 
9j. 



d 
dr]t 



K 



f(6 J , Vs ) = Y 



i=i 



1 



d 



BiFri^Fil-ris^dr], 



B(F Vs ,F{l-n s ))+Flog 



1 



p.j t 



K 



K 



(44) 



■js 



= -Fj^Pjs inFVs) - r(F(l - r, s ))] + F £ log ^-f-p 

3=1 j=l J 

where T is a digamma function. When F > 1, we use recurrence relation r(x + 1) = 1/x + T(x 
twice to get 

r(Fr/ s ) = T(Fr ?s + 2 



1 



1 



F ^ + 1 ^ , (45) 

r(F(l-r, s ))=T(F(l- Vs ) + 2) 



F(l - Vs ) + 1 F(l - Vs 



Because r\ G [0, 1], we may use 



i _L i 



j i^ j ^ _i_ 



exp(r(x)) x """ 2x^ """ 4-3!x 3 ~ 2-4!-x 4 ~ 48-5!-a; 



+ 



17 



3- at x = Fr/ S + 2 and 



:r = .F(l — rj s ) + 2 to solve for ^ s numerically. When F = 1, however, we may use the refection 
formula T(l — r/ s ) — T(t/ s ) = -zrcot (7rr/ s ) to solve for r] s analytically. 

The forward and backward probabilities of the ancillary HMM and other parameter estimates 
are simply special cases of the main HMM. 
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6 Supplementary Figures 
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True Admixture Proportion 



Figure SI: Inference of admixture proportion. The x-axis denotes the truth of admixture pro- 
portions, and the y-axis denotes inferred values. The gray line indicates x = y. For a three-way 
admixed individual there are three numbers (that sum to 1) to denote admixture proportions. For 
admixture events that happened recently (7 = 10,20), the inference of admixture proportions is 
very accurate; for remote admixture events (7 = 100), our method slightly over-estimates large 
admixture proportions and slightly under-estimates the small ones. 
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Figure S2: Detailed comparison with LAMP-LD. The plots shows the comparison for a typical 
simulated individual using g = 20. Each panel denotes an ancestry, on which we plot the local 
ancestry of the actual (black line) , our inference (blue line) , and LAMP-LD inference (pink line) . 
Compare to our method, LAMP-LD makes more mistakes on regions of a few hundred SNPs. 
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Figure S3: Regions that exhibit excessive European ancestry in Mexicans. The y-axis is the average 
European ancestry allele dosages (over 58 Mexican samples). The g in legends means 7 in main 
text. The inlets are close-ups of top peak(s) in each panel. The top panel contains two peaks, 
one is in lp36.31 from 6.25Mb to 6.55Mb with average dosage of 1.63, and the other is in lq44 
from 246.00Mb to 246.40Mb with average dosage of 1.65; The middle panel contains one peak in 
12pl3.32 from 8.66Mb to 9.92Mb with average dosage of 1.72; the bottom panel contains one peak 
in 18q21.1 from 44.18Mb to 44.51Mb with average dosage of 1.78. 
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